{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Secure Data Science on AWS\n",
    "\n",
    "The most common security considerations for building secure data science projects in the cloud touch the areas of compute and network isolation, authentication and authorization, data encryption, artifact management, auditability, monitoring and governance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "session = boto3.session.Session()\n",
    "\n",
    "ec2 = boto3.Session().client(service_name='ec2', region_name=region)\n",
    "sm = boto3.Session().client(service_name='sagemaker', region_name=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieve the Notebook Instance Name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Notebook Instance Name: ds-notebook-dev-dev-antjebar\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "notebook_instance_name = None\n",
    "\n",
    "try:\n",
    "    with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n",
    "        data = json.load(notebook_info)\n",
    "        resource_arn = data['ResourceArn']\n",
    "        region = resource_arn.split(':')[3]\n",
    "        notebook_instance_name = data['ResourceName']\n",
    "    print('Notebook Instance Name: {}'.format(notebook_instance_name))\n",
    "except:\n",
    "    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')\n",
    "    print('[ERROR]: COULD NOT RETRIEVE THE NOTEBOOK INSTANCE METADATA.')\n",
    "    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compute and Network Isolation \n",
    "\n",
    "This SageMaker notebook instance has been set up **without** Internet access. The notebook instance runs within a VPC without Internet connectivity but still maintains access to specific AWS services such as Elastic Container Registry and Amazon S3.  Access to a shared services VPC has also been configured to allow connectivity to a centralized repository of Python packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'NotebookInstanceArn': 'arn:aws:sagemaker:us-east-1:806570384721:notebook-instance/ds-notebook-dev-dev-antjebar', 'NotebookInstanceName': 'ds-notebook-dev-dev-antjebar', 'NotebookInstanceStatus': 'InService', 'Url': 'ds-notebook-dev-dev-antjebar.notebook.us-east-1.sagemaker.aws', 'InstanceType': 'ml.t3.medium', 'SubnetId': 'subnet-03f68f0599a0db188', 'SecurityGroups': ['sg-0559ec22f560603ea'], 'RoleArn': 'arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar', 'KmsKeyId': '4a1e174c-4fd1-484d-9205-33527bc34f2f', 'NetworkInterfaceId': 'eni-0cf80cd4d6dc6a544', 'LastModifiedTime': datetime.datetime(2020, 12, 16, 1, 30, 58, 708000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2020, 12, 16, 1, 27, 2, 530000, tzinfo=tzlocal()), 'NotebookInstanceLifecycleConfigName': 'ds-notebook-lc-dev-dev', 'DirectInternetAccess': 'Disabled', 'VolumeSizeInGB': 10, 'DefaultCodeRepository': 'https://git-codecommit.us-east-1.amazonaws.com/v1/repos/ds-source-dev-dev', 'RootAccess': 'Enabled', 'ResponseMetadata': {'RequestId': '6758a481-7fe0-49b8-8c8f-8887bcd36604', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '6758a481-7fe0-49b8-8c8f-8887bcd36604', 'content-type': 'application/x-amz-json-1.1', 'content-length': '876', 'date': 'Wed, 16 Dec 2020 19:40:38 GMT'}, 'RetryAttempts': 0}}\n"
     ]
    }
   ],
   "source": [
    "response = sm.describe_notebook_instance(\n",
    "        NotebookInstanceName=notebook_instance_name\n",
    ")\n",
    "\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review The Following Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SubnetId: subnet-03f68f0599a0db188\n",
      "SecurityGroups: ['sg-0559ec22f560603ea']\n",
      "IAM Role: arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar\n",
      "NetworkInterfaceId: eni-0cf80cd4d6dc6a544\n",
      "DirectInternetAccess: Disabled\n"
     ]
    }
   ],
   "source": [
    "print('SubnetId: {}'.format(response['SubnetId']))\n",
    "print('SecurityGroups: {}'.format(response['SecurityGroups']))\n",
    "print('IAM Role: {}'.format(response['RoleArn']))\n",
    "print('NetworkInterfaceId: {}'.format(response['NetworkInterfaceId']))\n",
    "print('DirectInternetAccess: {}'.format(response['DirectInternetAccess']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = sm.describe_notebook_instance(\n",
    "    NotebookInstanceName='string'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Verify That Internet Access Is Disabled\n",
    "\n",
    "Expected result: \n",
    "You should see a timeout without a path to the Internet or a proxy server.  \n",
    "```Failed to connect to aws.amazon.com port 443: Connection timed out```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "^C\n"
     ]
    }
   ],
   "source": [
    "!curl https://www.datascienceonaws.com/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By removing public internet access in this way, we have created a secure environment where all the dependencies are installed, but the notebook now has no way to access the internet, and internet traffic cannot reach the notebook either. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Authentication and Authorization\n",
    "\n",
    "SageMaker notebooks need to be assigned a role for accessing AWS services. Fine grained access control over which services a SageMaker notebook is allowed to access can be provided using Identity and Access Management (IAM). \n",
    "\n",
    "To control access at a user level, data scientists should typically not be allowed to create notebooks, provision or delete infrastructure. In some cases, even console access can be removed by creating PreSigned URLs, that directly launch a hosted Jupyter environment for data scientists to use from their laptops. \n",
    "\n",
    "Moreover, admins can use resource [tags for attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to ensure that different teams of data scientists, with the same high-level IAM role, have different access rights to AWS services, such as only allowing read/write access to specific S3 buckets which match tag criteria. \n",
    "\n",
    "For customers with even more stringent data and code segregation requirements, admins can provision different accounts for individual teams and manage the billing from these accounts in a centralized Organizational Unit. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review IAM Role and Region For This Notebook Instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sm = boto3.Session().client(service_name='sagemaker', region_name=region)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IAM Role: arn:aws:iam::806570384721:role/service-role/ds-notebook-role-dev-dev-antjebar\n",
      "Region: us-east-1\n"
     ]
    }
   ],
   "source": [
    "print(\"IAM Role: {}\".format (role))\n",
    "print(\"Region: {}\".format(region))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TODO: List IAM Role and Policies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grant `Least Privilege` for IAM Roles and Policies\n",
    "\n",
    "IAM roles and policies help you control access to AWS resources. You create policies which define permissions, and attach the policies to IAM users, groups of users, or roles. \n",
    "\n",
    "Policies types include identity-based and resource-based policies among others. Identity-based policies are tied to an identity, such as IAM users or roles. In contrast, resource-based policies are attached to a resource such as an Amazon S3 bucket. \n",
    "\n",
    "Here is a sample policy attached to the SageMaker notebook instance IAM role. This policy restricts the IAM Role to one specific S3 bucket. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TODO: Attach this policy to IAM Role"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Action\": [\n",
    "                \"s3:GetObject\",\n",
    "                \"s3:PutObject\",\n",
    "                \"s3:ListObject\"\n",
    "            ],\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Resource\": [\n",
    "                \"arn:aws:s3:::sagemaker-us-east-1-123456789-secure\",\n",
    "                \"arn:aws:s3:::sagemaker-us-east-1-123456789-secure/*\"\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Let's try to copy data to this S3 bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false
   },
   "outputs": [],
   "source": [
    "!echo s3://$bucket-secure/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp security.ipynb s3://$bucket-secure/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's try to copy data over to a different S3 bucket!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp ./security.ipynb s3://$bucket/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Without a VPC Configured\n",
    "\n",
    "To test the networking controls, run the following cell below. Here you will first attempt to train the model without an associated network configuration. You should see that the training job is stopped around the same time as the \"Downloading - Downloading input data\" message is emitted. \n",
    "\n",
    "#### Detective control explained\n",
    "\n",
    "The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. \n",
    "\n",
    "To learn more about how the detective control does this, assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). \n",
    "\n",
    "You can also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:sagemaker.deprecations:The method get_image_uri has been renamed in sagemaker>=2.\n",
      "See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n",
      "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n",
      "INFO:sagemaker.image_uris:Defaulting to only available Python version: py3\n",
      "INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "module 'sagemaker' has no attribute 's3_input'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-37-9625c26ea5c9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ms3_input_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ms3_input\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms3_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m's3://sagemaker-workshop-cloudformation-{}/quickstart/train_data.csv'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mregion\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontent_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      2\u001b[0m \u001b[0ms3_input_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msagemaker\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ms3_input\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ms3_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m's3://sagemaker-workshop-cloudformation-{}/quickstart/test_data.csv'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mregion\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcontent_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mprint\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m\"Training data at: {}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0ms3_input_train\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'DataSource'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'S3DataSource'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'S3Uri'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m\"Test data at: {}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0ms3_input_test\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'DataSource'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'S3DataSource'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'S3Uri'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAttributeError\u001b[0m: module 'sagemaker' has no attribute 's3_input'"
     ]
    }
   ],
   "source": [
    "s3_input_train = sagemaker.s3_input(s3_data='s3://sagemaker-workshop-cloudformation-{}/quickstart/train_data.csv'.format (region), content_type='csv')\n",
    "s3_input_test = sagemaker.s3_input(s3_data='s3://sagemaker-workshop-cloudformation-{}/quickstart/test_data.csv'.format (region), content_type='csv')\n",
    "print (\"Training data at: {}\".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))\n",
    "print (\"Test data at: {}\".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "xgb = sagemaker.estimator.Estimator(\n",
    "    image,\n",
    "    role,\n",
    "    train_instance_count=1,\n",
    "    train_instance_type='ml.m4.xlarge',\n",
    "    train_max_run=3600,\n",
    "    output_path='s3://{}/{}/models'.format(output_bucket, prefix),\n",
    "    sagemaker_session=sess,\n",
    "    train_use_spot_instances=True,\n",
    "    train_max_wait=3600,\n",
    "    encrypt_inter_container_traffic=False\n",
    ")  \n",
    "\n",
    "xgb.set_hyperparameters(\n",
    "    max_depth=5,\n",
    "    eta=0.2,\n",
    "    gamma=4,\n",
    "    min_child_weight=6,\n",
    "    subsample=0.8,\n",
    "    verbosity=0,\n",
    "    objective='binary:logistic',\n",
    "    num_round=100)\n",
    "\n",
    "xgb.fit(inputs={'train': s3_input_train})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train with VPC\n",
    "\n",
    "This time provide the training job with the network settings that were defined above. This time we shouldn't see the **Client Error** as before!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, traindataprefix), content_type='csv')\n",
    "s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, testdataprefix), content_type='csv')\n",
    "print (\"Training data at: {}\".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))\n",
    "print (\"Test data at: {}\".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "preprocessing_trial_component = tracker.trial_component\n",
    "\n",
    "trial_name = f\"cc-fraud-training-job-{int(time.time())}\"\n",
    "cc_trial = Trial.create(\n",
    "    trial_name=trial_name,\n",
    "    experiment_name=cc_experiment.experiment_name,\n",
    "    sagemaker_boto_client=sm)\n",
    "\n",
    "cc_trial.add_trial_component(preprocessing_trial_component)\n",
    "cc_training_job_name = \"cc-training-job-{}\".format(int(time.time()))\n",
    "xgb = sagemaker.estimator.Estimator(\n",
    "    image,\n",
    "    role,\n",
    "    train_instance_count=1,\n",
    "    train_instance_type='ml.m4.xlarge',\n",
    "    train_max_run=3600,\n",
    "    output_path='s3://{}/{}/models'.format(output_bucket, prefix),\n",
    "    sagemaker_session=sess,\n",
    "    train_use_spot_instances=True,\n",
    "    train_max_wait=3600,\n",
    "    subnets=subnets, \n",
    "    security_group_ids=\n",
    "    sec_groups,  \n",
    "    train_volume_kms_key=cmk_id,\n",
    "    encrypt_inter_container_traffic=False\n",
    ")  \n",
    "\n",
    "xgb.set_hyperparameters(\n",
    "    max_depth=5,\n",
    "    eta=0.2,\n",
    "    gamma=4,\n",
    "    min_child_weight=6,\n",
    "    subsample=0.8,\n",
    "    verbosity=0,\n",
    "    objective='binary:logistic',\n",
    "    num_round=100)\n",
    "\n",
    "xgb.fit(\n",
    "    inputs={'train': s3_input_train},\n",
    "    job_name=cc_training_job_name,\n",
    "    experiment_config={\n",
    "        \"TrialName\":\n",
    "        cc_trial.trial_name,  #log training job in Trials for lineage\n",
    "        \"TrialComponentDisplayName\": \"Training\",\n",
    "    },\n",
    "    wait=True,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encryption\n",
    "\n",
    "To ensure that the processed data is encrypted at rest on the processing cluster, we provide a customer managed key to the volume_kms_key command below.  This instructs Amazon SageMaker to encrypt the EBS volumes used during the processing job with the specified key. Since our data stored in Amazon S3 buckets are already encrypted, data is encrypted at rest at all times.\n",
    "\n",
    "Amazon SageMaker always uses TLS encrypted tunnels when working with Amazon SageMaker so data is also encrypted in transit when traveling from or to Amazon S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Use SageMaker Processing with SKLearn. -- combine data into train and test at this stage if possible.\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "sklearn_processor = SKLearnProcessor(\n",
    "    framework_version='0.20.0',\n",
    "    role=role,\n",
    "    instance_type='ml.c4.xlarge',\n",
    "    instance_count=1,\n",
    "    network_config=network_config,  # attach SageMaker resources to your VPC\n",
    "    volume_kms_key=cmk_id  # encrypt the EBS volume attached to SageMaker Processing instance\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Job Name:  sagemaker-scikit-learn-2020-12-16-21-19-58-652\n",
      "Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/data', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]\n",
      "Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/test_data', 'LocalPath': '/opt/ml/processing/test', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'train_data_headers', 'S3Output': {'S3Uri': 's3://ds-data-bucket-dev-dev-0ab3a23e45a7/secure-sagemaker-demo/train_headers', 'LocalPath': '/opt/ml/processing/train_headers', 'S3UploadMode': 'EndOfJob'}}]\n",
      "...........................\n",
      "\u001b[34m/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n",
      "  import imp\u001b[0m\n",
      "\u001b[34mReceived arguments Namespace(random_split=0, train_test_split_ratio=0.2)\u001b[0m\n",
      "\u001b[34mReading input data from /opt/ml/processing/input/rawdata.csv\u001b[0m\n",
      "\u001b[34mRunning preprocessing and feature engineering transformations\u001b[0m\n",
      "\u001b[34mTrain data shape after preprocessing: (24000, 23)\u001b[0m\n",
      "\u001b[34mTest data shape after preprocessing: (6000, 23)\u001b[0m\n",
      "\u001b[34mSaving training features to /opt/ml/processing/train/train_data.csv\u001b[0m\n",
      "\u001b[34mComplete\u001b[0m\n",
      "\u001b[34mSave training data with headers to /opt/ml/processing/train_headers/train_data_headers.csv\u001b[0m\n",
      "\u001b[34mSaving test features to /opt/ml/processing/test/test_data.csv\u001b[0m\n",
      "\u001b[34mComplete\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "\n",
    "sklearn_processor.run(\n",
    "    code=codeupload,\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=raw_data_location, \n",
    "            destination='/opt/ml/processing/input'\n",
    "        )\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name='train_data',\n",
    "            source='/opt/ml/processing/train',\n",
    "            destination=train_data_location),\n",
    "        ProcessingOutput(\n",
    "            output_name='test_data',\n",
    "            source='/opt/ml/processing/test',\n",
    "            destination=test_data_location),\n",
    "        ProcessingOutput(\n",
    "            output_name='train_data_headers',\n",
    "            source='/opt/ml/processing/train_headers',\n",
    "            destination=train_header_location)\n",
    "    ],\n",
    "    arguments=['--train-test-split-ratio', '0.2'])\n",
    "\n",
    "preprocessing_job_description = sklearn_processor.jobs[-1].describe()\n",
    "\n",
    "output_config = preprocessing_job_description['ProcessingOutputConfig']\n",
    "for output in output_config['Outputs']:\n",
    "    if output['OutputName'] == 'train_data':\n",
    "        preprocessed_training_data = output['S3Output']['S3Uri']\n",
    "    if output['OutputName'] == 'test_data':\n",
    "        preprocessed_test_data = output['S3Output']['S3Uri']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model development and Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store the values used in this notebook for use in the second demo notebook:\n",
    "trial_name = trial_name  \n",
    "experiment_name = cc_experiment.experiment_name\n",
    "training_job_name = cc_training_job_name\n",
    "%store trial_name \n",
    "%store experiment_name \n",
    "%store training_job_name"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
